了解多媒体内容中描述或显示的事件彼此相关是开发可用于真实世界媒体的强大人工智能系统的关键组成部分。尽管许多研究专门用于文本,图像和视频域中的事件理解,但没有一个研究探索事件跨域中经历的复杂关系。例如,新闻文章可能会描述“抗议”事件,而视频显示“逮捕”事件。认识到视觉“逮捕”事件是更广泛的“抗议”事件的一个子事件,这是一个具有挑战性但重要的问题,但前面的工作尚未探讨。在本文中,我们提出了多模式事件关系关系的新任务,以识别这种跨模式事件关系。我们贡献了一个大规模数据集,该数据集由100K视频新文章对组成,以及密集注释的数据的基准。我们还提出了一种弱监督的多模式方法,该方法将来自外部知识库(KB)的常识性知识整合在一起,以预测丰富的多模式事件层次结构。实验表明,我们的模型在我们提出的基准上优于许多竞争基线。我们还对模型的性能进行了详细的分析,并建议未来研究的方向。
translated by 谷歌翻译
具有输入序列长度的标准推理和基于变压器的体系结构的训练四倍。对于各种应用程序,尤其是在网页翻译,查询播放等方面,这非常大,因此,最近已经开发了几种方法来通过强制执行不同的注意力结构(例如稀疏性,低秩,使用内核)来加速注意计算。 。在这项工作中,我们将注意力计算视为最近的邻居检索的计算,并使用基于决策树的层次导航来降低每个查询令牌的检索成本,从线性序列长度从线性长度到几乎对数。基于这样的层次导航,我们设计了树形的树形,它可以使用两个有效的注意层之一 - TF - 注意和TC - 注意。 TF注意力以细粒的样式计算出注意力,而TC意见是一个粗糙的注意力层,它也确保梯度是“密集”的。为了优化此类具有挑战性的离散层,我们提出了一种两级自举训练方法。使用对标准NLP基准测试的广泛实验,尤其是对于长期序列,我们证明了我们的树形架构几乎可以像基线变压器一样准确,而注意力层则使用了30倍的失败。与Linform相比,在注意力层中使用类似的拖鞋时,准确性可能会高达12%。
translated by 谷歌翻译
Semi-supervised learning is becoming increasingly important because it can combine data carefully labeled by humans with abundant unlabeled data to train deep neural networks. Classic methods on semi-supervised learning that have focused on transductive learning have not been fully exploited in the inductive framework followed by modern deep learning. The same holds for the manifold assumption-that similar examples should get the same prediction. In this work, we employ a transductive label propagation method that is based on the manifold assumption to make predictions on the entire dataset and use these predictions to generate pseudo-labels for the unlabeled data and train a deep neural network. At the core of the transductive method lies a nearest neighbor graph of the dataset that we create based on the embeddings of the same network. Therefore our learning process iterates between these two steps. We improve performance on several datasets especially in the few labels regime and show that our work is complementary to current state of the art.
translated by 谷歌翻译
Image descriptors based on activations of Convolutional Neural Networks (CNNs) have become dominant in image retrieval due to their discriminative power, compactness of representation, and search efficiency. Training of CNNs, either from scratch or fine-tuning, requires a large amount of annotated data, where a high quality of annotation is often crucial. In this work, we propose to fine-tune CNNs for image retrieval on a large collection of unordered images in a fully automated manner. Reconstructed 3D models obtained by the state-of-the-art retrieval and structure-from-motion methods guide the selection of the training data. We show that both hard-positive and hard-negative examples, selected by exploiting the geometry and the camera positions available from the 3D models, enhance the performance of particular-object retrieval. CNN descriptor whitening discriminatively learned from the same training data outperforms commonly used PCA whitening. We propose a novel trainable Generalized-Mean (GeM) pooling layer that generalizes max and average pooling and show that it boosts retrieval performance. Applying the proposed method to the VGG network achieves state-of-the-art performance on the standard benchmarks: Oxford Buildings, Paris, and Holidays datasets.
translated by 谷歌翻译